library(tidyverse)
library(ggplot2)
library(tidyr)
library(dplyr)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)Final Project Pooja Shah
Introduction
Dataset(s) Introduction:
I will be using a kaggle dataset about Superstore Sales for Sales Forecasting [https://www.kaggle.com/datasets/rohitsahoo/sales-forecasting]. The dataset has information about sales in a retail store and I will use that data to forecast sales pattern.
The dataset has almost 10k rows of information about sales and their 18 attributes. Each row consists of information such as the ordered goods information, shipment information, customer information and sales information.
What questions do you like to answer with this dataset(s)?
My main aim with this dataset is to understand the trends within the data and predict on the amount of goods that will be sold in the nearby future.
This will lead to knowing how much surplus amount of goods a company should have ready to make sure there isn’t a shortage in the stores in the future.
Describe the data set(s)
- displaying the data
data <- read_csv("PoojaShah_FinalProjectData/_data/data.csv")
data- descriptive information of the dataset
dim(data)[1] 9800 18
colnames(data) [1] "Row ID" "Order ID" "Order Date" "Ship Date"
[5] "Ship Mode" "Customer ID" "Customer Name" "Segment"
[9] "Country" "City" "State" "Postal Code"
[13] "Region" "Product ID" "Category" "Sub-Category"
[17] "Product Name" "Sales"
# Modifying Column Names
colnames(data)[3] = "OrderDate"
colnames(data)[5] = "ShipMode"
colnames(data)[16] = "SubCategory"
colnames(data) [1] "Row ID" "Order ID" "OrderDate" "Ship Date"
[5] "ShipMode" "Customer ID" "Customer Name" "Segment"
[9] "Country" "City" "State" "Postal Code"
[13] "Region" "Product ID" "Category" "SubCategory"
[17] "Product Name" "Sales"
- cleaning the data;
# Modifying data to split dates
dateData <- data %>%
separate_wider_delim(OrderDate, '/', names = c("Date", "Month", "Year"))
dateDatadateData$Month = as.numeric(as.character(dateData$Month))
dateData$Year = as.numeric(as.character(dateData$Year))
# Making a new dataframe with only necessary colums
newData <- dateData %>%
select(c("Month", "Year", "ShipMode", "Segment", "Country", "State", "Region", "Category", "SubCategory", "Sales"))
newData- summary statistics of data:
summary(newData) Month Year ShipMode Segment
Min. : 1.000 Min. :2015 Length:9800 Length:9800
1st Qu.: 5.000 1st Qu.:2016 Class :character Class :character
Median : 9.000 Median :2017 Mode :character Mode :character
Mean : 7.818 Mean :2017
3rd Qu.:11.000 3rd Qu.:2018
Max. :12.000 Max. :2018
Country State Region Category
Length:9800 Length:9800 Length:9800 Length:9800
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
SubCategory Sales
Length:9800 Min. : 0.444
Class :character 1st Qu.: 17.248
Mode :character Median : 54.490
Mean : 230.769
3rd Qu.: 210.605
Max. :22638.480
newData %>%
select(Sales) %>%
sapply(sd) Sales
626.6519
The dataset consists of almost 10K rows of data, all of which are already cleaned. The data is about sales data for a superstore containing sales information about different categories as well as sub-categories. All the sales are in United States. The details also include the details such as segment of order to describe if the order was a consumer order, corporate order or of some other type. The location from which the order was placed is divided into city, state, postal code as well as region. Our main aim for this project is to be forecasting the sales for different categories in different regions.This is why, we are aggregating the data to have precise results.
The Tentative Plan for Visualization
- Briefly describe what data analyses and visualizations you plan to conduct to answer the research questions you proposed above.
I plan to visualize the data in different manners so as to understand the different proportions in which the data is divided. This will help visualize the data as well as better understand what the data is tring to show.
- Explain why you choose to conduct these specific data analyses and visualizations. In other words, how do such types of statistics or graphs (see the R Gallery) help you answer specific questions? For example, how can a bivariate visualization reveal the relationship between two variables, or how does a linear graph of variables over time present the pattern of development?
I tried to form different bar graphs that would visualize different combinations of data. I tried to cross the data columns available and provide graphs for all the aspects that seemed to be important to me. Based on this data, I decided the final columns which I will be considering for my evaluation.
ggplot(data, aes(Region, fill = Segment)) +
geom_bar() +
labs(title = "Sales per Region by Segment") 
ggplot(data, aes(Region, fill = Segment)) +
geom_bar() +
labs(title = "Sales per Region by Segment divided by States") +
facet_wrap(vars(State)) 
ggplot(data, aes(Category, fill = Segment)) +
geom_bar() +
labs(title = "Sales by Category per Segment")
ggplot(data, aes(Category)) +
geom_bar() +
labs(title = "Sales by SubCategory per Segment") +
facet_wrap(vars(Segment))
ggplot(data, aes(Segment)) +
geom_bar() +
labs(title = "Sales by Segment per SubCategory") +
facet_wrap(vars(SubCategory))
- If you plan to conduct specific data analyses and visualizations, describe how do you need to process and prepare the tidy data.
As mentioned above, my main aim is to forecast the sales for different categories in different regions. This is why I am mutating the dataset to have the information in the form needed.
- (Optional) It is encouraged, but optional, to include a coding component of tidy data in this part.
totalSales1 <- newData %>%
select(Month, Year, Region, Category, Segment, Sales)
totalSales1- Here, I am using 6 columns of the data that seemed important for my analysis. These columns are as under:
colnames(newData) [1] "Month" "Year" "ShipMode" "Segment" "Country"
[6] "State" "Region" "Category" "SubCategory" "Sales"
totalSales <- totalSales1 %>%
group_by(Month, Year, Region, Category, Segment) %>%
summarise(across(c(Sales), sum))
totalSalesggplot(totalSales, aes(Region, weight = Sales)) +
geom_bar() +
labs(title = "Total Sales by Region per Category") +
facet_wrap(vars(Category))
Evaluation
You will be evaluated on both the quality of your source code and your written report, with a greater emphasis on the clarity and details of the description of your dataset(s) and your research questions.
model <- lm(Sales~Month+Year+Segment+Region+Category, data = totalSales)
print(model)
Call:
lm(formula = Sales ~ Month + Year + Segment + Region + Category,
data = totalSales)
Coefficients:
(Intercept) Month Year
-355712.3 141.1 176.8
SegmentCorporate SegmentHome Office RegionEast
-740.6 -1162.4 437.8
RegionSouth RegionWest CategoryOffice Supplies
-198.5 513.9 -106.1
CategoryTechnology
278.9
cat("# # # # The Coefficient Values # # # ","\n")# # # # The Coefficient Values # # #
a <- coef(model)[1]
print(a)(Intercept)
-355712.3
XMonth <- coef(model)[2]
XYear <- coef(model)[3]
XSegmentCorporate <- coef(model)[4]
XSegmentHomeOffice <- coef(model)[5]
XRegionEast <- coef(model)[6]
XRegionSouth <- coef(model)[7]
XRegionWest <- coef(model)[8]
XCategoryOfficeSupplies <- coef(model)[9]
XCategoryTechnology <- coef(model)[10]#predicting sales for year 2019 in the month of July for Corporate in Region East for Category Technology
X1 = 7
X2 = 2019
X3 = 1
X4 = 0
X5 = 1
X6 = 0
X7 = 0
X8 = 0
X9 = 1
Y1 = a + XMonth*X1 + XYear*X2 + XSegmentCorporate*X3 + XSegmentHomeOffice*X4 + XRegionEast*X5 + XRegionSouth*X6 + XRegionWest*X7 + XCategoryOfficeSupplies*X8 + XCategoryTechnology*X9sales for year 2019 in the month of July for Corporate in Region East for Category Technology:
print(Y1)(Intercept)
2285.047
#predicting sales for year 2019 in the month of July for Corporate in Region East for Category Office Supplies
X1 = 7
X2 = 2019
X3 = 1
X4 = 0
X5 = 1
X6 = 0
X7 = 0
X8 = 1
X9 = 0
Y2 = a + XMonth*X1 + XYear*X2 + XSegmentCorporate*X3 + XSegmentHomeOffice*X4 + XRegionEast*X5 + XRegionSouth*X6 + XRegionWest*X7 + XCategoryOfficeSupplies*X8 + XCategoryTechnology*X9sales for year 2019 in the month of July for Corporate in Region East for Category Office Supplies:
print(Y2)(Intercept)
1900.07
#predicting sales for year 2019 in the month of July for Corporate in Region South for Category Technology
X1 = 7
X2 = 2019
X3 = 1
X4 = 0
X5 = 0
X6 = 1
X7 = 0
X8 = 0
X9 = 1
Y3 = a + XMonth*X1 + XYear*X2 + XSegmentCorporate*X3 + XSegmentHomeOffice*X4 + XRegionEast*X5 + XRegionSouth*X6 + XRegionWest*X7 + XCategoryOfficeSupplies*X8 + XCategoryTechnology*X9sales for year 2019 in the month of July for Corporate in Region South for Category Technology:
print(Y3)(Intercept)
1648.761
#predicting sales for year 2019 in the month of July for Corporate in Region South for Category Office Supplies
X1 = 7
X2 = 2019
X3 = 1
X4 = 0
X5 = 0
X6 = 1
X7 = 0
X8 = 1
X9 = 0
Y4 = a + XMonth*X1 + XYear*X2 + XSegmentCorporate*X3 + XSegmentHomeOffice*X4 + XRegionEast*X5 + XRegionSouth*X6 + XRegionWest*X7 + XCategoryOfficeSupplies*X8 + XCategoryTechnology*X9sales for year 2019 in the month of July for Corporate in Region South for Category Office Supplies:
print(Y4)(Intercept)
1263.783
#predicting sales for year 2019 in the month of July for Corporate in Region West for Category Technology
X1 = 7
X2 = 2019
X3 = 1
X4 = 0
X5 = 0
X6 = 0
X7 = 1
X8 = 0
X9 = 1
Y5 = a + XMonth*X1 + XYear*X2 + XSegmentCorporate*X3 + XSegmentHomeOffice*X4 + XRegionEast*X5 + XRegionSouth*X6 + XRegionWest*X7 + XCategoryOfficeSupplies*X8 + XCategoryTechnology*X9sales for year 2019 in the month of July for Corporate in Region West for Category Technology:
print(Y5)(Intercept)
2361.127
#predicting sales for year 2019 in the month of July for Corporate in Region East for Category Office Supplies
X1 = 7
X2 = 2019
X3 = 1
X4 = 0
X5 = 0
X6 = 0
X7 = 1
X8 = 1
X9 = 0
Y6 = a + XMonth*X1 + XYear*X2 + XSegmentCorporate*X3 + XSegmentHomeOffice*X4 + XRegionEast*X5 + XRegionSouth*X6 + XRegionWest*X7 + XCategoryOfficeSupplies*X8 + XCategoryTechnology*X9sales for year 2019 in the month of July for Corporate in Region West for Category Office Supplies:
print(Y6)(Intercept)
1976.149
tab <- matrix(c(Y1, Y2, Y3, Y4, Y5, Y6), ncol=3, byrow=TRUE)
colnames(tab) <- c('East','South','West')
rownames(tab) <- c('OfficeSupplies','Technology')
tab <- as.table(tab)
tab East South West
OfficeSupplies 2285.047 1900.070 1648.761
Technology 1263.783 2361.127 1976.149
Sales Forecast for July 2019